The concept of shared economy has been a hot topic in recent year. Airbnb, the leader in shared housing field, would be a great example of how it improve people’s life and make booking accommodation much easier than before. In this project, we are interested in the host listings information in Washington, DC. from Airbnb. This is an interesting project to us as it would be helpful for tourists to make decision on accommodation when they plan to travel to DC. Meanwhile, the result is also useful for hosts who are looking for improvement on their listing performance. We generate maps to study some characteristics of Airbnb listings in DC. Also, we did research on factors that are associated with the popularity of a property listed in Airbnb.
Nowadays, more and more people tend to use Airbnb to save budget, make friends and, most importantly, experience a city like a local. In the meantime, more rooms are cleaned, decorated and leased for small fortunes or just fun. With our thorough study of Airbnb data and resulting visualization, we hope that people could use Airbnb more efficiently, maximizing their benefit both as travelers and landlords.
Our project offers useful information for both hosts and guests in Washington, DC. Hosts would find it helpful since they can improve their room and align with customers’ demand to raise higher profit or increase popularity. Meanwhile, guests will be able to have a broader view of the housing market in Washington. Our study has revealed some popular regions that people like to live in. This would help guests better prepare for their trip to DC.
The dataset we use is from Inside Airbnb (http://insideairbnb.com/get-the-data.html). The website provides data publicly available from Airbnb. The reason why we choose this dataset is because it contains well-organized comprehensive data ready to be used directly in data analysis. We choose to focus on analyzing all the rooms provided in Washington D.C. because this is a popular tourist city with large potentials on Airbnb development and expansion. The fields we would like to focus on are:
# Import all the necessary packages
import csv
import pandas as pd
import numpy as np
import nltk
import string
from nltk.corpus import stopwords
from matplotlib import pyplot as plt
import seaborn as sns
import string
from nltk.corpus import stopwords
from scipy.misc import imread
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
from PIL import Image
import random
import folium
from folium.plugins import MarkerCluster, ScrollZoomToggler, HeatMap
import json
import os
import statsmodels.stats.api as sm
from statsmodels.formula.api import ols
import statsmodels.graphics
import statsmodels.formula.api as smf
from jinja2 import Template
from branca.element import MacroElement
import branca.colormap as cm
import folium.plugins
import matplotlib.axes as ax
%matplotlib inline
# Read csv files
with open('wdc_listings.csv') as csvfile:
df = pd.DataFrame.from_csv(csvfile)
df.head()
##Modify the data set
df['occupied_percent'] = (365 - df['availability_365'])/ 365 * 100
def isType(entry, type_want):
"""The function change original variabel into a dummy variable specifying if a statement is right
argument: tf is a chatacter specifying true or false
output is a interger """
if (entry == type_want):
return 1
else:
return 0
def changeAmenity(entry):
entry = str(entry)
entry = entry.strip('{')
entry = entry.strip('}')
entry= entry.strip('')
ame_list = entry.split(',')
ame_list_new = [[i.strip('"') for i in ame_list]]
return ame_list_new
df['host_is_superhost'] = [isType(i, 't') for i in df['host_is_superhost']]
df['host_has_profile_pic'] = [isType(i, 't') for i in df['host_has_profile_pic']]
df['host_identity_verified'] = [isType(i, 't') for i in df['host_identity_verified']]
df['entire_home'] = [isType(i, 'Entire home/apt') for i in df['room_type']]
df['private_room'] = [isType(i, 'Private room') for i in df['room_type']]
df['require_guest_phone_verification'] = [isType(i, 't') for i in df['require_guest_phone_verification']]
df['cancel_strict'] = [isType(i, 'strict') for i in df['cancellation_policy']]
df['instant_bookable'] = [isType(i, 't') for i in df['instant_bookable']]
df['host_response_rate'] = [ float(str(i).strip('%')) for i in df['host_response_rate']]
df['host_acceptance_rate'] = [ float(str(i).strip('%')) for i in df['host_acceptance_rate']]
df['cleaning_fee'] = np.nan_to_num(df['cleaning_fee'])
df['security_deposit'] = np.nan_to_num(df['security_deposit'])
df['amenities'] = [changeAmenity(i) for i in df['amenities']]
pop = pd.DataFrame(df.groupby('neighbourhood_cleansed')['occupied_percent'].mean())
pop.reset_index(pop, inplace=True)
#pop.head()
pop_price = pd.DataFrame(df.groupby('neighbourhood_cleansed')['price'].mean())
pop_price = pop_price.sort_values(by = 'price', ascending = [0])
pop_price.reset_index(pop_price, inplace=True)
pop_count = pd.DataFrame(df.groupby('neighbourhood_cleansed')['name'].count())
pop_count.reset_index(pop_price, inplace=True)
#pop_count
list_loc = [[row['latitude'], row['longitude'], row['occupied_percent'], row['price'], row['neighbourhood_cleansed']] for index, row in df.iterrows()]
### numbers of listings
DC_COORDINATES = (38.9072, -77.0369)
dc = os.getcwd() + '/wdc.geojson'
m = folium.Map(DC_COORDINATES, zoom_start = 12)
marker_cluster = folium.MarkerCluster("Cluster Name").add_to(m)
for location in list_loc[:]:
folium.Marker(location = location[:2], icon=folium.Icon(color='red'),
popup = str(location[3])).add_to(m).add_to(marker_cluster)
m.choropleth(geo_str = open(dc).read(),
data=pop_count,
columns=['neighbourhood_cleansed', 'name'],
key_on='properties.neighbourhood',
fill_color='GnBu',
threshold_scale=[0, 30, 60, 100, 250, 500],
fill_opacity = 0.85,
line_weight = 1.2,
legend_name = "Number of listings in a District")
toggler = ScrollZoomToggler().add_to(m)
m
We want to display the price distribution of Airbnb within the DC region. In order to make it easy to see, we made a DC map and used different colors to represent the density of the listing. Dark blue represents high density while light green represents low density. Each marker represents one Airbnb post, and the number on the marker shows the daily price for that specific post. As we can see from the map, the density of Airbnb housing is more centered within the central DC area, especially the northern region of White House and Lincoln Memorial. According to the map, we can see there are 1312 posting within that region. After we look into the pricing within every neighborhoods, we found out that in central area, the price range varies from 200 to 400 dollars, while in outer areas, the daily prices are raletively lower, for less than 100 dollars. In the following bar chart, we will demonstrate the average price for each neighborhood in a clearer fashion.
sns.color_palette("Blues")
sns.barplot(y='neighbourhood_cleansed', x='price', data =pop_price.head(10))
plt.title('Neighbourhood With Highest Airbnb Listing Price')
plt.xlabel('Price ($)')
plt.ylabel('Neighbourhood')
plt.show()
As what we have seen on the bar chart above, we can clearly see that 'Georgetown, Burleith/Hillandale" reagion has the highest average daily price for Airbnb listing, while "Shaw, Logan Circle" region has the lowest average daily price for the listing. Such result fits with the conclusion we have drawn in the map above.
### occupied rate(measuring popularity)
DC_COORDINATES = (38.9072, -77.0369)
dc = os.getcwd() + '/wdc.geojson'
m = folium.Map(DC_COORDINATES, zoom_start = 12)
marker_cluster = folium.MarkerCluster("Cluster Name").add_to(m)
for location in list_loc[:]:
folium.Marker(location = location[:2], popup = str(location[2])).add_to(m).add_to(marker_cluster)
m.choropleth(geo_str = open(dc).read(),
data=pop,
columns=['neighbourhood_cleansed', 'occupied_percent'],
key_on='properties.neighbourhood',
fill_color='BuPu',
threshold_scale=[0, 10, 20, 30, 40, 100],
fill_opacity = 0.85,
line_weight = 1.2,
legend_name = "Occupied Rate(%)")
toggler = ScrollZoomToggler().add_to(m)
m
We want to see the distribution of occupation rate for DC Airbnb listing. Similar to the prevous map, darker color is referring to higher density of the Airbnb posting within the region. Each marker represents one single Airbnb post, and the number accompanies with the marker is representing the percentage of the occupation rate within 365 days. We make a guess that central regions might have higher occupation rate while outer regions have less occupation rate. Because we assume that most guests are there for short term visit, therefore it is likely that they would have higher probability of staying in the central areas while it is more convenient and closer to the place of interests. While looking into the map, we found that the occupation rate varies in each neighborhood, and it is not guarenteed that central areas are having higher occupation rate than the outer regions. Therefore we will look into this data further and see what are the elements that influence the ‘popularity’ of the posting.
### price
m = folium.Map(DC_COORDINATES, zoom_start = 12)
marker_cluster = folium.MarkerCluster("Cluster Name").add_to(m)
for location in list_loc:
folium.Marker(location = location[:2], icon=folium.Icon(color='blue'),
popup = str(location[3])).add_to(m).add_to(marker_cluster)
m.choropleth(geo_str = open(dc).read(),
data=pop_price,
columns=['neighbourhood_cleansed', 'price'],
key_on='properties.neighbourhood',
fill_color='OrRd',
threshold_scale=[0, 50, 75, 100, 150, 300],
fill_opacity = 0.85,
line_weight = 1.2,
legend_name = "Price")
toggler = ScrollZoomToggler().add_to(m)
m
In order to see the distribution of price in a more direct way, we made the third map. The color indicates the level of pricing within the area. As we can see from the index bar, the darker the color indicates higher average price of the region. In the previous map, we conclude that central part of the DC might have higher price, and outer region of the DC would have lower price. According to our observation in this map, we found that this might not be absolutely true. As we can see, central region and the regions around Potomac River are having higher Airbnb price, while south earthen region and northern region of the DC are having lower Airbnb price. Later we will use a multi linear regression model to find out which elements are the indicators for posts’ popularity.
#sns.distplot(df['occupied_percent'], bins=None, hist=True)
plt.hist(df['occupied_percent'], 50, normed=1, facecolor='purple', alpha=0.75)
plt.title('DIstribution of Occupied Percentage of Listings')
plt.xlabel('Occupied Rate')
plt.ylabel('Proportion')
plt.show()
In this graph, we are trying to show the distribution of occupied percentage of the listings. In the chart, we have 50 bars, and each bar represents 2% of the occupation rate. As we can see from the chart, most listings are having low occupation rate. There are around 7% Airbnb hosts are having less than 2% occupation rate within last 365 days. But the graph also indicate that there are about 3% of the listings are having almost 100% occupation rate, which indicate that there are only a few housings that have extreme popularity in DC region. Now we want to compare the most popular and the least popular postings.
We decide to look into the listig that has over 75% of occupancy rate and compare them with those with below 10% occupancy.
above75 = df.query('occupied_percent >= 75')
below10 = df.query('occupied_percent <= 10')
plt.boxplot([above75['price'], below10['price']], 0 ,'')
plt.title('Distribution of Accomodates of Listings')
plt.xlabel('Group')
plt.ylabel('Price')
plt.xticks(range(1,3),['Above 75%', 'Below 10%'])
plt.show()
In the box plot above, we can see the price distribution for listing that has over 75% occupation rate and the price distribution for listing that has less than 10% occupation rate. Both of them are having relatively the same average price, while the listing with higher than 75% occupation rate is having smaller variation than the listing with lower than 10% occupation rate. It is surprising to see that the listing with less than 10% occupation rate is having a higher maximum price. But it is not true to say that lower price indicates a higher popularity.
plt.hist(above75['accommodates'], 15, normed=1, facecolor='blue', alpha=0.25, label='Above 75%')
plt.hist(below10['accommodates'], 15, normed=1, facecolor='green', alpha=0.25, label='Below 10%')
plt.legend(loc='upper right')
plt.title('Distribution of Accomodates of Listings')
plt.xlabel('Accommodates')
plt.ylabel('Proportion')
plt.show()
In this graph, we wish to see the relationship between the max number of accommodates and the popularity of the listing. The purple color is representing listings that are above 75%, and the green color is representing listings that are below 10%. As we can see both histogram are having similar shapes, both of them are skewed to the right, which is indicating that there are less postings have high number of accommodates (greater than 10 ppl). According to the graph, we can see that for 2 accommodates, there are higher proportion of listings for those above 75% than those less than 10%. So we can conclude that popular posts are roughly with 2 to 3 accommodates.
above_list = sum(list(above75['amenities']), [])
result = sum(above_list, [])
result = pd.DataFrame(result)
result.reset_index(result, inplace=True)
ame_count = result.groupby(0).count()
ame_count.reset_index(ame_count, inplace=True)
ame_count = ame_count.sort_values(by='index', ascending=[0])
ame_count['percent'] = ame_count['index'] / above75.shape[0] * 100
sns.barplot(y=0, x='percent', data =ame_count.head(15))
plt.title('Top Amenities in Popular Listings')
plt.xlabel('Percent')
plt.ylabel('Amenties')
plt.show()
below_list = sum(list(below10['amenities']), [])
result = sum(below_list, [])
result = pd.DataFrame(result)
result.reset_index(result, inplace=True)
ame_count_b = result.groupby(0).count()
ame_count_b.reset_index(ame_count_b, inplace=True)
ame_count_b = ame_count_b.sort_values(by='index', ascending=[0])
ame_count_b['percent'] = ame_count_b['index'] / below10.shape[0] * 100
sns.barplot(y=0, x='percent', data =ame_count_b.head(15))
plt.title('Top Amenities in Less Popular Listings')
plt.xlabel('Percent')
plt.ylabel('Amenties')
plt.show()
Now we want to see the relationship between Amenities and popularities. As what shown above, we have two bar charts, one is the percentage of the Amenties for popular listings (with over 75% occupancy rate), another one is the percentage of the Amenties for non-popular listings (with less than 10% occupancy rate). As what we can see from those two graphs, both of them are having similar shapes, while there is higher percentage of AC and lower percentage of heater among the popular listings. Overall, those two graphs do not offer much useful information regarding the indicator of the popularity.
neighbour1 = above75.groupby('neighbourhood').count()
neighbour1 = neighbour1.sort_values(by = 'name', ascending = [0])
neighbour1['name'] = neighbour1['name']/ above75.shape[0] *100
neighbour1.reset_index(neighbour1, inplace=True)
#neighbour1.head(10)
neighbour2 = below10.groupby('neighbourhood').count()
neighbour2 = neighbour2.sort_values(by = 'name', ascending = [0])
neighbour2['name'] = neighbour2['name']/ below10.shape[0] * 100
neighbour2.reset_index(neighbour2, inplace=True)
#neighbour2.head(10)
sns.barplot(x='name', y='neighbourhood', data =neighbour1.head(10))
plt.title('Top Regions With Popular Listings')
plt.xlabel('Percent')
plt.ylabel('Neighbourhood')
plt.show()
sns.barplot(x='name', y='neighbourhood', data =neighbour2.head(10))
plt.title('Top Regions With Less Popular Listings')
plt.xlabel('Percent')
plt.ylabel('Neighbourhood')
plt.show()
In those two graphs above show the relationship between the region and the popularity of the posting. As we can see in the first graph, which shows the top regions for popular listings, there are more popular postings in “Columbia Heights” regions and little listings in Kalorama region. While in the second graph, it shows there are lots unpopular listings within “Capitol Hill” region, also “Columbia Heights” region is the second unpopular region. There we know that within one region, it can have lots popular and lots unpopular listings at the same time. Therefore we need to look into other elements and see if there is a single indicator deciding the popularity of a single posting.
# above75 description Wordcloud
description = above75.dropna(subset = ['description'])
words = str.join(' ', description.description)
stop = set(stopwords.words('english'))
stop.add('room')
f = open('description.txt','w')
f.write(words)
logo = np.array(Image.open("Airbnb-logo.png"))
image_colors = ImageColorGenerator(logo)
text = open('description.txt').read()
wc = WordCloud(stopwords=stop).generate(text)
plt.imshow(wc.recolor(color_func=image_colors), interpolation='bilinear')
plt.axis("off")
plt.show()
# below10 description Wordcloud
description = above75.dropna(subset = ['description'])
words = str.join(' ', description.description)
stop = set(stopwords.words('english'))
stop.add('room')
f = open('description.txt','w')
f.write(words)
logo = np.array(Image.open("Airbnb-logo.png"))
image_colors = ImageColorGenerator(logo)
text = open('description.txt').read()
wc = WordCloud(stopwords=stop).generate(text)
plt.imshow(wc.recolor(color_func=image_colors), interpolation='bilinear')
plt.axis("off")
plt.show()
Now we want to see if the words in description influence the popularity of the listing. As what have shown above, the words such as “apartment”, “bedroom”, “metro” are frequently used in both popular and non-popular listings. In the graph, the bigger the words shows the more frequent the word appears in the ‘description’ section. By comparing both word cloud, the unpopular listings tend to mention the word ‘metro’ and ‘kitchen’ more. Overall both popular and non-popular listings are using similar words in the ‘description’ section. Later we will make anther word cloud to show which words are most used in ‘description’ section overall.
# Get the average price
neighbourhood =df.dropna(subset = ['neighbourhood'])
neighbourhood_price = neighbourhood.groupby('neighbourhood').mean()
neighbourhood_price['neighbourhood'] = neighbourhood_price.index
neighbourhood_price = neighbourhood_price.sort_values('price', ascending = False)
# Draw the bar chart
sns.barplot(y='neighbourhood', x='price', data = neighbourhood_price[:10])
plt.title('Top 10 Neighborhood with Highest Average Room Price')
plt.xlabel('Price')
plt.ylabel('Neighbourhood')
plt.show()
As presented in this barchart, Peasant Hill is the most expesive place listed in Airbnb in Wshington, DC. The average prices is almost \$600, it could be that there are more single houses are posted in that area. The second highest average price is at Hillcrest with amost $400 per night. After looking into the places with high price, we realize that these neighbourhoods are not closed to downtown. They are either the Northeast side of the DC or closed to mountain and hill, which could be an indicator of local people's residental patterm in DC. Genegrally, the places listed at a higher price are not close to downtown, so their main taget customers are not tourists.
# Stacked Bar
Top10Neigh = neighbourhood_price[:10]['neighbourhood']
neighbourhood_no = neighbourhood.groupby(['neighbourhood', 'room_type']).count()
neighbourhood_no= neighbourhood_no[['listing_url']]
neighbourhood_no = neighbourhood_no.unstack(level=1)
neighbourhood_no = neighbourhood_no.loc[[w[0] in Top10Neigh for w in neighbourhood_no.iterrows()], :]
neighbourhood_no = neighbourhood_no.T
for n in Top10Neigh:
neighbourhood_no[[n]] = neighbourhood_no[[n]]/neighbourhood_no[[n]].sum()
neighbourhood_no = neighbourhood_no.T
p1 = plt.barh(np.arange(neighbourhood_no.shape[0]), neighbourhood_no[[0]].values, color='#F9AA8D')
p2 = plt.barh(np.arange(neighbourhood_no.shape[0]), neighbourhood_no[[1]].values, left=neighbourhood_no[[0]].values, color='#F7C78E')
p3 = plt.barh(np.arange(neighbourhood_no.shape[0]), neighbourhood_no[[2]].values, left=neighbourhood_no[[0]].values + neighbourhood_no[[1]].values, color='#F7DD8C')
plt.xlabel('Percentage')
plt.ylabel('Neighbourhoods')
plt.yticks(np.arange(neighbourhood_no.shape[0]), neighbourhood_no.index.values)
plt.title('Three Room Types Distribution of the Top 10 Highest Price Neighbourhoods', y=1.08)
plt.legend((p1[0], p2[0],p3[0]), ('Entire', 'Private', 'Shared'),bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.show()
This stacked plot is showing the patterm of room types of these most expensive places. As we can see that most of the properties listed are posted to be booked as an entire house. A small portion of the properties are listed as private room and share public space with the hosts. Nearly no rooms are listed as shared rooms. This confirms that our speculation above that the man target of these properties are high-end customers instead of guests who come to DC as tourists.
description = df.dropna(subset = ['description'])
words = str.join(' ', description.description)
words
stop = set(stopwords.words('english'))
stop.add('room')
f = open('description.txt','w')
f.write(words)
text = open('description.txt').read()
logo = np.array(Image.open("Airbnb-logo.png"))
wc = WordCloud(background_color="white", max_words=500, mask=logo, stopwords=stop)
# generate word cloud
wc.generate(text)
image_colors = ImageColorGenerator(logo)
wc.recolor(color_func=image_colors)
# show
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.show()
Lastly, we try to make a word cloud that describe which words are frequently used in ‘description’ section overall. In the word cloud above, it shows that words such as “apartment”, “bedroom”, “metro”, “kitchen” are frequently used in the “description” section.
rcode = """occupied_percent ~ price + host_response_rate +
entire_home + private_room + guests_included + bathrooms +
bedrooms + cancel_strict +
number_of_reviews + review_scores_location +
review_scores_value"""
lm = smf.ols(formula= rcode, data=df).fit()
print lm.summary()
To generate some quatitative results to further estimate the factors that are associated with the popularity of an Airbnb post, we use a multi-linear regression model to mesure each factors' contribution to the popularity of a room listed on Airbnb. Eventually, we came up with this this model.
The factors that we thought are influential included price, room type, host response rate, superhost effect, whether the host has a profile picture, number of people allowed, bathrooms, bedrooms and beds, amount of security deposit and cleaning fee, cancellationpolicy, number of reviews, reviews on defferent aspects. In the end, we have selected the model as stated above. The parameters taht are significant on impacting the popularity of a room. We chose to measure the occupied percentage of tidays within a day as a measurement of the popularity of a room. It is continuous, and this we decide to use multi-linear regression to estimate the impact.
Based on the model result, we see that the significant variables that are cnsidered significant are: price, host response rate whetehr the property is listed as entire home, whether the room is private, number of guests included, number of bathroom, bedrooms, cancellation policy, number of reviews, review scores of location and value. Within these variables, some positive factor include:
Meanwhile, some negative factors that contrubuting to this model include:
Form the result above, we have ome insights into guests decision making process and the points that they pay attention to. Price is obviously an importnat factor as people all like to save money. At the same time, they also focus on privacy, Rooms with more privacy are more popular than shared rooms. The result demonstrates that people travel to DC tends to be small groups with in two to three people. Therefore, a larger room is not as desirable as those that just holds that right number of a group. They also consider bathroom a crucial deather for a room listed on Airbnb. People are more likely to book a room woth more bathroom and bedroom. Most importantly, connection s formed on Airbnb is also a factor influencing people's decision. Higher review on value and lcation are attracting factors for guests.
The model coefficiemts of the most likely to be influential make sense in our result. Yet the R square is small with 7.7% explanatory power. This means that we have missed some significant factor that rae not in our data set. For example, location is an important factor of the decision-making process when a tourist choose a place to live. This is excluded form the linear regression model as a categorical variabble with over 30 levels is meaningless in a linear model. Including neighbourhood will require a more advanced model to quantify the effect. Moreover, another factor that is not noticed within the model is the picture that are posted in the website of Airbnb. The picture of a room is the first impression fo a guest and is mostly to effect people;s decision since people generally pay more attention topicture instead of words. With only a url linkto the picture, we are unable to quantify the quality of an image of a property.
In this project, our main target is to find out the characteristics of popular rooms listed on the Airbnb. We first looked into the distribution of rooms in Washington, DC. The rooms that are more popular and more expensive are mainly near the center of Washington, DC. Meanswhile, we inpected two groups of rooms, one is more popular and the other is rarely booked. Result shows that the description information presented online is not the main difference between these two group. Also, model results show that several factors are positively and negatively affacting the popularity of a room listed on Airbnb. More research is needed to be done as we believe that some important facators are missed withinour model.